Predicting a Country's Population Density

EXECUTIVE SUMMARY

The World’s Population is growing every second at the same time the clock is ticking. As of November 2020, the world’s population is estimated to be around 7.8 Billion. With each second having a new birth, there is always a possibility of overpopulation in a lot of areas in the world. For a country, it is very crucial to determine the statistics behind its population. The most relevant question for us now is: Can we predict a country’s population density by using the demographic category and coordinates of a location?

Population density is the measurement of population per unit area. It simply refers to the number of people living in an area, for example, a household. This research will be tackling on the different population density values of 3 Countries which are Philippines, Japan, and South Korea.

The research paper will be creating different machine learning models that will predict the population density.ML Models used for the research are as follows: a. Linear Regressor b. Decision Tree Regressor c. Random Forest Regressor

Using Apache Spark with m4.large instances with 8GB of RAM in AWS EMR, then visualizing, preprocessing 11 gigabytes of raw data, and analyzing the results, the research paper was able to conclude significant insights. It was observed that for Japan, South Korea and Philippines, it is possible to predict the population density using the latitude and longitude parameters and also the demographic parameter. It is also observed that among the ML Models used, the Decision Tree Regressor was able to predict the high accuracies with respect to all countries.

INTRODUCTION

For a country, it is very crucial to determine number of population it has. One of the most important factors in considering the economic life of a country is checking the country’s growth in terms of population. One thing to note is that, when there are a lot of people in certain area, all other resources will either be in decrease or in increase. If there would be a large number of people living in certain household, chances are that the utilities like water, food, electricity, and internet usage would be much higher than a smaller number of people. In the case of Tokyo, Japan, rent is notoriously high due to the population density of the city. At the same time, the available resources for such utilities would decrease which is the law of supply and demand. On the other hand, having population density concentrated in certain areas leaves nature untouched by people, maintaining mother nature's beauty for us to appreciate, which is the case in many areas of our country, the Philippines. This could also aid in a country's economy by attracting travelers by developing the tourism industry. Therefore, countries are very keen on checking the population growth rate in different areas within their jurisdiction.

One way of checking the effect of population is checking the population density. Population density is the measurement of population per unit area. It simply refers to the number of people living in an area like for example, a household. In checking the population density, there are certain factors to consider. A lot of countries check if their population density is too high, which might lead to overconsumption of resources and diminishing supply. Countries can also check if their population density in some areas is too low, indicating lagging development in sparsely populated areas. Indeed, population density has a role in economic development and perhaps further studies can be studied in this aspect[2].

The High Resolution Population Density from Open Data in AWS's Open Data Registry is a set of population density data for a selection of countries from Facebook Connectivity Lab and Center for International Earth Science Information Network – CIESIN – Columbia University. This dataset estimates the number of people living within 30-meter grid tiles.In this research, we will try to create a machine learning model to predict the population density of each country

PROBLEM STATEMENT

1. Can we predict a country’s population density by using the CIESIN and Facebook dataset?

BUSINESS VALUE
METHODOLOGY

To properly address the problem, the researchers will use a portion of the Facebook - CIESIN dataset. Data from select countries such as Japan, South Korea, and the Philippines will be used. Some information such as the country and demographic are in the file name so columns will be added during processing. The researchers will follow the general workflow defined below to arrive at a conclusion and recommendations.

image.png

OVERVIEW OF THE METHODOLOGY

Each step will be discussed in detail in the following sections. To give a general overview of the methodology, a brief description for each step is described below:

1. Data Gathering and Preprocessing

The filepath for the dataset are as follows:

    aws s3 ls s3://dataforgood-fb-data/ --no-sign-request

Documentation Link:

    https://dataforgood.fb.com/docs/

Please note that there are more files that can be used for the project under the chosen dataset but to minimize the scope of the research, only files for select countries will be used. The file sizes were checked by downloading and decompressing them locally, arriving at a total of about 11GB.

For this dataset in particular, no cleaning and no transformation was required. Multiple files were merged to arrive at the final complete dataset.

2. Data Description

3. Exploratory Data Analysis

4. Machine Learning Models

5. Interpretation of Results

DATA GATHERING

The data used for the study was sourced from the AWS Open Data Registry, and can be found by searching for the data titled "High Resolution Population Density Maps + Demographic Estimates by CIESIN and Facebook". It contains almost 27GB of files, many of which are CSV files that have coordinate and the population density columns. The dataset has population density data for many countries. To minimize this research, we would be only focusing on the following:

1. Philippines
2. South Korea
3. Japan

We will be creating different Machine Learning Models for each country to try and predict the population density.

Picture of File Size

image.png

Picture of Instances with Instance Type and Workers

Picture of Dashboard showing Workers

DATA PREPROCESSING

Since the dataset can be considered big due to the total size of the files and the sheer number of rows and columns, an emr spark cluster is used to perform the preprocessing. It can seen that there are several folders for each of the three countries. These folders correspond to each demographic category. To process the data, the data will be appended into one consolidated spark dataframe which will have a specific column for the demographic and the country it belongs. In this way, a single dataframe of the whole dataset will be obtained.

The data processing infrastructure is composed of a spark created in AWS EMR consisting of a master,and 3 workers. EC2 instances of type m4.xlarge with 64GB storage was used.

The command below will connect the notebook to the existing spark cluster and also some codes to preprocess the files.

DATA DESCRIPTION

The dataset contains the following columns and their descriptions.

Philippine, Japan and South Korea Population Density Dataset:

Column Name Description
latitude The latitude (EPSG:4326/WGS84) coordinates of the center of the 1-arc-second-by-1-arc-second grid cell
longitude The longitude (EPSG:4326/WGS84) coordinates of the center of the 1-arc-second-by-1-arc-second grid cell
population The value is the (statistical) number of people in that grid of coordinates
country Name of Country
demographic Demography/Division of Population
Exploratory Data Analysis and Machine Learning Model

Plotting Philippines, South Korea and Japan

Given the sheer number of rows in the data, a small sample is taken to allow us to visualize and have a glimpse of the information. We will be only checking a fraction of the dataset around 0.005% since parsing all through it would take a long time

This visualization shows all countries included in the dataset, with a sample of their respective population densities. Some areas are less populated than others while the most dense areas are usually near the capital city of each country.

A view of each country's population density distribution follows using plotly are as follows.

Plotting Japan

In Japan, the most dense areas aside from Tokyo seem to be Osaka and Nagoya. Aside form that, there seems to be an outlier in the city of Matsue where there seems to be a dense concentration of women. The population distribution covers the whole country except in some mountainous regions. The most dense areas concentrate around where the shinkansen passes.

ML Model for Japan

We will be using the the DataJP used to show the graph above and create ML Model that will predict the population density.

Sample head of DataJP

Schema of DataJP

Preprocessing for ML Models

Converting all Strings Type to Category Type

Converting the needed Features in a Vector Feature.

For this data, we will be using the latitude, longitude and also the demography as the main features for all our ML Models.

Scaling the Data using MinMax Scaling

Splitting the Data in Train and Test Data

Linear Regression Model for JP

We will using the Linear Regression Model of pyspark to attempt to create a prediction Model.

Decision Tree Regressor for JP

Random Forest Regressor for JP

Plotting South Korea

In South Korea, the most dense area aside from Seoul is Busan. Aside form that, there are also a concentration popuation in Daegu and Gwangju. The population distribution covers the whole country except for the northeast region, which is also a mountainous region. Interestingly, the Korail also connects the most populated areas in South Korea.

ML Model for South Korea

The processing done on the japan dataset will be also done on the korea data, shown by the cell below:

Preprocessing

Linear Regressor Model for South Korea

Decision Tree Regressor for South Korea

Random Forest Regressor for South Korea

Plotting Philippines

In the Philippines, the densely populated areas are limited to Metro Manila and Davao. An outline of the country can't be clearly seen. Most dense areas outside of Manila and Davao are composed of men, with the exception of Sulu which has a concentration of women. Compared to the previous two countries, we don't have a national railroad system. Also compared to the previous two countries, the disparity between the places where people congregate and the rest of the country is apparent, with the markers in other places almost invisible.

ML Model for PH

Preprocessing

Linear Regression Model for PH

Decision Tree Regressor for PH

Random Forest Regressor for PH

RESULTS AND DISCUSSION

We were able to plot the population density of each country and visualize it perfectly. In First graph, we can see that the total population visualization for Philippines greatly outweighs the visualization for South Korea and Japan. We can also see that for each country, there are a certain area where the population density is centered and are coupled to each other. These areas are very important to consider for each country since they can generate development and economic activity. The first graph also shows in what area are the population density the lowest. Using this visualization, they can target their projects and the government aids to those important area. From this point of view, we can already derive that the population density in the Philippines is much greater than South Korea and Japan.

Machine Learning Models:

JAPAN

The Machine Learning Models created for predicting Japan Population Density are divided into Linear Regression, Decision Tree Regressor and Random Forest Regressor. Based on the results, we can note that the highest accuracy is achieved in Decison Tree Regressor Model when compared to the other models. We should take note that for this research, we will set the linear regression as the baseline model for the machine learning process for all countries. We can see that the accuracy for the linear regression is very low which suggest that the latitutde and longitude are not linearly correlated with the population density data of japan.This goes the same for the demography of the japan dataset. This can be justified by the fact the training accuracy is within 11% and the test accuracy around 10%. It also seen that the root mean square error for the linear regression is quite big comparing the needed predicted value. It is seen also that for the Japan Dataset, we can see the highest accuracy on the Decision Tree Regressor with almost a 99% accuracy for both the training set and test set. For the Ramdom Forest Regressor, it is noted that the accuracy is a little bit lower at 93%

SOUTH KOREA

The Machine Learning Models created for predicting South Korea Population Density are divided into Linear Regression, Decision Tree Regressor and Random Forest Regressor. Based on the results, we can note that the highest accuracy is achieved in Decision Tree Model when compared to the other models. We can see that the accuracy for the linear regression is very low (the same with japan) which suggest that the latitutde and longitude are not linearly correlated with the population density data of south korea. This goes the same for the demography of the South Korea dataset. This can be justified by the fact that the training accuracy is within 13% and the test accuracy around 15%. It also seen that the root mean square error for the linear regression is quite big comparing the needed predicted value. It is seen also that for the South Korea Dataset, we can see the highest accuracy on the Decision Tree Regressor with almost a 97% and 87% accuracy for the training set and test set. For the Random Forest Regressor, it is noted that the accuracy is a little bit lower.

PHILIPPINE

The Machine Learning Models created for predicting Philippine Population Density are divided into Linear Regression, Decision Tree Regressor and Random Forest Regressor. Based on the results, we can note that the highest accuracy is achieved in Decision Tree Model when compared to the other models. We can see that the accuracy for the linear regression is very very low (same with japan and south korea) when comparing to the other countries which suggest that the latitutde and longitude are not linearly correlated with the population density data of philippines. This goes the same for the demography of the Philippines dataset. This can be justified by the fact the training accuracy is within 2% and the test accuracy around 2%. It also seen that the root mean square error for the linear regression is quite big comparing the needed predicted value. It is seen also that for the Philippine Dataset, we can see the highest accuracy on the Decision Tree Regressor with almost a 36% accuracy for the test set which suggest that the data of the population density for philippines is not as optimized as compared to the other countries. For the Ramdom Forest Regressor, it is noted that the accuracy is a little bit lower than the decision tree forest.

CONCLUSION AND RECOMMENDATION

A dataset from Open Data Registry of AWS entitled High Density Population Maps Dataset was parsed and underwent data analysis, data preprocessing and further filtering. To answer the problem statement, if we can predict the population density using the latitude, longitude and demography features of the dataset. There are several machine learning models created to determine if the problem is answerable and also to compare each results and achieve good conclusion.

The results show the following conclusions:

  1. The Japan Dataset was able to have a 99% Accuracy in predicting the Population Density using the Decision Tree Model.
  2. The South Korea Dataset was able to have a 95% Accuracy in predicting the Population Density using the Random Forest Model.
  3. The Philippines Dataset was able to have a 73% Accuracy in predicting the Population Density using the Decision Tree Model.
Additional recommendations are given also to further improve the research. It is recommended to augment this population density dataset with other information that can be seen in other dataset to provide more context or supplement each other as features. It is also recommended that the dataset will be parsed through different neural network models.

REFERENCES AND ACKNOWLEDGEMENTS

[1] Yegorov, Yuri. (2015). Economic Role of Population Density. https://www.researchgate.net/publication/283637652_Economic_Role_of_Population_Density Accessed 29 Nov 2020.

[2] Facebook Connectivity Lab and Center for International Earth Science Information Network – CIESIN – Columbia University. 2016. High Resolution Settlement Layer (HRSL). Source imagery for HRSL © 2016 DigitalGlobe. https://dataforgood.fb.com/docs/high-resolution-population-density-maps-demographic-estimates-documentation/ Accessed 29 Nov 2020.